1 Packages

As usual, we will need to load some packages.

Load the tidyverse and DESeq2 packages, install them if they are not already.

require("tidyverse")
require("DESeq2")

Now, load the DESeq2 object that you saved previously using the readRDS function.

Load the Rds object containing the DESeq2 object into the “dds” variable

dds <- readRDS("./dds.Rds")

While we already applied some filtering the to count matrix that we used as an input, we can still pre-filter the dataset further. You can, in fact, treat the DESeqDataSet object (in our dds variable) as a matrix to subset it.

For example, dim(dds) will return the dimensions of the matrix (number of features or genes, and number of samples). Similarly, we can subset to the first 3 samples and first 10 genes with dds[1:10,1:3].

To access the full count matrix, you can use counts(dds), and to access the sample metadata, you use colData(dds).

2 Some more prefiltering

2.1 Pre-filtering genes

It is customary to remove genes that have very low expression, since their signal-to-noise ratio is very small to yield any biologically significant results.

Keep in the dds object only genes with at least 10 reads in more than 25% of the samples

Useful functions:

  • rowSums()
nrow(dds)
#> [1] 20180
nSamples <- ncol(dds)*.25
genes.keep <- rowSums(counts(dds) >=10) > nSamples
dds <- dds[genes.keep,]
nrow(dds)
#> [1] 17399

We went from about 20,200 genes down to about 17,400.

2.2 Sample characterization

Now we want to explore the samples in our dataset. In order to compare the columns of our matrix, however, we need to normalize the counts. The simplest way would be to take the sum of each column (i.e., the total library size of each sample) and divide the counts by the value. The DESeq2 package offers some more sophisticated methods, and we will use the “variance stabilising transformation” in this tutorial (the vst() function). The output is a DESeqTransform object, and the values calculated can be extracted from it using the assay() function:

norm_expr <- vst(dds) %>% assay

Execute the command above to obtain a matrix of normalized gene expression

norm_expr <- vst(dds) %>% assay

Before we plot the expression values we just calculated, let’s install and load the pheatmap package, used to plot heatmaps.

require("pheatmap")

Now, instead of plotting the whole normalized expression matrix that we obtained using the vst() function, let us only plot the top 1,000 most varying genes using the pheatmap() function.

Calculate the standard deviation for each gene and pick the top 1,000 genes with the highest. Plot the vst expression matrix for those genes using the pheatmap() function.

Useful functions:

  • apply()
  • sd()
  • rowSds()
  • arrange()
  • desc()
  • slice_max()
  • pheatmap()
plot.genes <- tibble(gene_name=rownames(norm_expr), sd=norm_expr %>% rowSds ) %>%
    slice_max(sd, n=1000) %>%
    pull(gene_name)
norm_expr[plot.genes, ] %>% pheatmap()

The heatmap is not that informative to compare across samples, because it only shows that some genes have a higher level of expression than others. By default, pheatmap() does not scale the data, but we can do that using scale=“row” or scale=“column”.

Plot the heatmap again but this time using scale=“row”

norm_expr[plot.genes, ] %>%
    pheatmap(scale="row")

What could the two clusters be representing? We can annotate the columns of the heatmap by using the option “annotation_col” of pheatmap(). We need to provide a data.frame with the variables to annotate. For example:

annotation_col=as.data.frame(colData(dds)[,c("condition","origin")])

Plot the same heatmap but adding annotations for the condition, the origin, the RIN values and the post-mortem time

norm_expr[plot.genes, ] %>%
    pheatmap(scale="row", annotation_col=as.data.frame(colData(dds)[,c("condition","origin", "rin", "pm_time_min")]))

The fact that the two cohorts are so different is a reason for concern. We know that the cohorts have very different RIN values and post-mortem intervals, and gene expression may reflect those differences.

2.2.1 Pairwise correlation between samples

We are going to calculate how different each pair of samples are, and then summarize each sample by its median difference to the rest of the samples.

The cor() function calculates the correlation between each pair of columns of the input matrix, and we can take advantage of it.

Use the cor() function on the subset of 1,000 most varying genes from the vst transformed gene expression data to calculate all pairwise correlations and plot the resulting correlation matrix using pheatmap().

Useful functions:

  • cor()
  • pheatmap()
Cors <- cor(norm_expr[plot.genes, ])
pheatmap(Cors)

Calculate the median correlation per sample.

Useful functions:

  • rowMedians()
  • sort()
medCors <- rowMedians(Cors)
names(medCors) <- rownames(Cors)
medCors %>% sort
#>  SL283585  SL283597  SL283590  SL283575  SL283616  SL283618  SL283570  SL283581 
#> 0.7567623 0.7572315 0.7606560 0.7871291 0.8066868 0.8169079 0.8392480 0.8453424 
#>  SL283566  SL283584  SL283576  SL283610  SL283599  SL283595  SL283596  SL283614 
#> 0.8495857 0.8548754 0.8627927 0.8654529 0.8686231 0.8698628 0.8741854 0.8835448 
#>  SL283587  SL283583  SL283617  SL283589  SL283592  SL283565  SL283605  SL283608 
#> 0.8841093 0.8846946 0.8890795 0.8904316 0.8909524 0.8911854 0.8920549 0.8924803 
#>  SL283586  SL283598  SL283600  SL283604  SL283603  SL283568  SL283593  SL283569 
#> 0.8941451 0.8952702 0.8963120 0.8989816 0.9008564 0.9013622 0.9015567 0.9016811 
#>  SL283601  SL283612  SL283572  SL283574  SL283606  SL283602  SL283578  SL283580 
#> 0.9024367 0.9046887 0.9079857 0.9086827 0.9087302 0.9100107 0.9100342 0.9122101 
#>  SL283591  SL283588  SL283579  SL283567  SL283577  SL283573  SL283594  SL283582 
#> 0.9133564 0.9152114 0.9152738 0.9155428 0.9178124 0.9234158 0.9240418 0.9274764 
#>  SL283571 
#> 0.9275740

A straightforward way to display outliers in this case is to plot a boxplot with ggplot2, which by default plots outliers as points:

ggplot(tibble(medCors)) + geom_boxplot(aes(x=1, y=medCors))

Plot a ggplot2 boxplot using the command above.

ggplot(tibble(medCors)) + geom_boxplot(aes(x=1, y=medCors))

We can see that four samples (SL283585, SL283597, SL283590, SL283575) are shown as outliers - they are quite different from all the rest.

Now we will plot all the samples in the first 2 principal components of a PCA. The easiest is probably to use the plotPCA() function from the DESeq2 package, which works on the result from vst(). Note that you have to run vst() again on the dds object, since plotPCA() works on its output.

Useful functions:

  • vst()
  • plotPCA()

Use the plotPCA() function to visualize the samples in the first 2 principal components of the expression data.

plotPCA(vst(dds))

Do the same, but for each cohort (origin) separately.

plotPCA(vst(dds)[,colData(dds)$origin=="PW"])

plotPCA(vst(dds)[,colData(dds)$origin=="NBB"])

Can you spot the two clusters in the NBB cohort?

The plotPCA() function from the DESeq2 package does something sneaky behind the scenes. If you check the help for the function, you’ll see that the argument “ntop” is set to 500 by default. To save time, the function carries out the PCA only on the top 500 most varying genes, similarly to what we were doing before (but using the top 1,000). What happens if we use, for example, only the top 100 instead?

Use plotPCA() with ntop=400, then ntop=200, and finally with ntop=10 for all samples.

plotPCA(vst(dds), ntop=400)

plotPCA(vst(dds), ntop=200)

plotPCA(vst(dds), ntop=10)

The fewer genes we use to construct the PCA, the more obvious the clusters become. What is going on? What could it be that separates our samples in two groups? Let’s have a look at this top 10 most varying genes.

First, load the gene descriptions that we used in the first tutorial using

geneNames <- readRDS(url("https://git.app.uib.no/neuromics/cell-composition-rna-pd/-/raw/master/Data/EnsDb.Hsapiens.v75.Rds"))

Load the geneNames.Rds file into the R session and convert it to a tibble

geneNames <- readRDS(url("https://git.app.uib.no/neuromics/cell-composition-rna-pd/-/raw/master/Data/EnsDb.Hsapiens.v75.Rds")) %>%
    as_tibble

Then, find the gene names of the top10 most varying genes in the geneNames object.

Find the top10 most varying genes in the geneNames object.

tibble(gene_name=rownames(norm_expr), sd=norm_expr %>% rowSds ) %>%
    slice_max(sd, n=10) %>%
    left_join(geneNames) %>%
    select(gene_name, seqnames) %>%
    unique

In which chromosome are most of these genes located? Can it explain the groups you obseved in the PCA?

Plot the PCA again using the plotPCA() function using ntop=10 and intgroup=“sex”.

plotPCA(vst(dds), ntop=10, intgroup="sex")

Although sex-associated gene expression is not exclusively restricted to sex chromosomes, it is of a much greater magnitude in the sex chromosomes.

2.3 Check sex assignment

We can use the strong association between sex-chromosome genes and sex to assess potential mis-labeling of samples. A quick way is to use all genes located in the Y chromosome to run a PCA.

Plot all the samples in the two principal components of the PCA of the Y-located genes, colouring the points by the sex of the sample. NOTE: use the vst() function on the whole dataset, subset after.

Ygenes <- geneNames %>% filter(seqnames == "Y", gene_biotype == "protein_coding") %>%
    pull(gene_name) %>% unique
vst(dds)[rownames(dds) %in% Ygenes,] %>% plotPCA(ntop=Inf, intgroup="sex")

As you can see, most of the variance is explained by the first principal component, as expected. It would in fact be enough to plot the values for the PC1.

correct_sign <- function(x) {
    if (median(x)<0) return(-1*x)
    else return(x)
}
pca1 <- vst(dds)[rownames(dds) %in% Ygenes,] %>% 
    assay %>%
    t %>%
    prcomp
pca1$x %>%
    as_tibble(rownames="sample_id") %>%
    select(sample_id, PC1) %>%
    mutate(PC1=correct_sign(PC1)) %>%
    left_join(as_tibble(colData(dds))) %>%
    ggplot(aes(x=sex, y=PC1)) +
    geom_boxplot(outlier.shape=NA) +
    geom_jitter(aes(colour=sex), height=0, width=.2)
correct_sign <- function(x) {
    if (median(x)<0) return(-1*x)
    else return(x)
}
pca1 <- vst(dds)[rownames(dds) %in% Ygenes,] %>% 
    assay %>%
    t %>%
    prcomp
pca1$x %>%
    as_tibble(rownames="sample_id") %>%
    select(sample_id, PC1) %>%
    mutate(PC1=correct_sign(PC1)) %>%
    left_join(as_tibble(colData(dds))) %>%
    ggplot(aes(x=sex, y=PC1)) +
    geom_boxplot(outlier.shape=NA) +
    geom_jitter(aes(colour=sex), height=0, width=.2)

For simplicity, we are going to remove from the dds object all genes mapped to the sex chromosomes.

Remove all X and Y genes from the dds object.

sex.genes <- geneNames %>% filter(seqnames %in% c("X","Y")) %>%
    pull(gene_name) %>% unique
dim(dds)
#> [1] 17399    49
dds <- dds[!rownames(dds) %in% sex.genes,]
dim(dds)
#> [1] 16755    49

Recalculate the pairwise correlations between samples as before with the filtered dataset. Plot the values and plot the PCA with ntop=1000

norm_expr <- vst(dds) %>% assay
top1000.genes <- tibble(gene_name=rownames(norm_expr), sd=norm_expr %>% rowSds ) %>%
    slice_max(sd, n=1000) %>%
    pull(gene_name)
Cors <- cor(norm_expr[top1000.genes, ])
medCors <- rowMedians(Cors)
names(medCors) <- rownames(Cors)
medCors %>% sort
#>  SL283597  SL283585  SL283590  SL283575  SL283616  SL283618  SL283581  SL283570 
#> 0.7627503 0.7682360 0.7782123 0.7977671 0.8229228 0.8239435 0.8447489 0.8559964 
#>  SL283566  SL283584  SL283576  SL283599  SL283610  SL283596  SL283595  SL283583 
#> 0.8576488 0.8708081 0.8778189 0.8798999 0.8831353 0.8856723 0.8859760 0.8899975 
#>  SL283614  SL283592  SL283587  SL283617  SL283565  SL283603  SL283608  SL283589 
#> 0.8914506 0.8977170 0.8985736 0.9024982 0.9032124 0.9049152 0.9060812 0.9098299 
#>  SL283605  SL283568  SL283604  SL283593  SL283586  SL283598  SL283600  SL283612 
#> 0.9100418 0.9114346 0.9129014 0.9129787 0.9129787 0.9136635 0.9140187 0.9146281 
#>  SL283602  SL283578  SL283601  SL283574  SL283606  SL283572  SL283569  SL283591 
#> 0.9175766 0.9194735 0.9209084 0.9209463 0.9216333 0.9224393 0.9234645 0.9243526 
#>  SL283580  SL283567  SL283579  SL283588  SL283594  SL283577  SL283573  SL283582 
#> 0.9244825 0.9249829 0.9256202 0.9308943 0.9318754 0.9344447 0.9352384 0.9364574 
#>  SL283571 
#> 0.9429330
ggplot(tibble(medCors)) + geom_boxplot(aes(x=1, y=medCors))

plotPCA(vst(dds))

In order to plot the sample ids, we need to use the “returnData=TRUE” option, in the plotPCA() function which, instead of plotting, will then returns the data in a format that can be easily used with ggplot2, adding a layer with the labels:

ggplot(data) +
geom_text(aes(label=name), nudge_y=1.5, size=3)

We could also, of course, run the PCA ourselves using prcomp().

plotPCA(vst(dds), returnData=TRUE) %>%
  ggplot(aes(x=PC1, y=PC2)) + 
  geom_point(aes(colour=condition)) +
  geom_text(aes(label=name), nudge_y=1.5, size=3)

Evaluating the PCA plot and the heatmap of correlations, do you think it is justified to remove any of the samples?

3 Cell type estimation

Before proceeding with the differential expression analysis, since we are dealing with human brain tissue, we need to investigate what the cell composition is in our samples. This is of fundamental importance in these type of datasets because most of the variability in gene expression in explained by the cell types.

We will use a very simple approach to estimate cell types similar, in a way, to what we used to “estimate” sex: summarize gene expression of a set of selected markers as their first principal component (PC1). The marker genes will be known cell type-specific genes.

Let’s load a list of markers of cortical cell types from Neuroespresso:

cortical.markers <- read_tsv("./Data/cortical_markers.tsv")

Read the tab-separated file “cortical_markers.tsv” into a tibble

cortical.markers <- read_tsv("./Data/cortical_markers.tsv")

Now let us run a PCA (as we have done until now) but subsetting the genes to the neuronal markers. We will use the prcomp() function and extract the values of the PC1 for each sample.

Run a PCA using prcomp() only on neuronal markers. Explore the result and find where the coordinates for the first PC are.

estimate_ct <- function(object, genes, ct="PC1", rescale=TRUE) {
    object.vst <- vst(object)
    pca <- object.vst[rownames(object.vst) %in% genes,] %>%
        assay() %>%
        t() %>%
        prcomp()
    result <- as_tibble(pca$x, rownames="sample_id") %>%
        select(sample_id, PC1) %>%
        mutate(PC1=correct_sign(PC1))
    colnames(result) <- c("sample_id", ct)
    if (rescale) result[,ct] <- scale(result[,ct])[,1]
    return(result)
}
neuronal.est <- estimate_ct(dds, cortical.markers %>% filter(cell_type=="Neuron") %>% pull(gene_name), ct="Neuron")

Now we can do that for every other cortical cell type.

Calculate the first principal component for the other cell types and add them all to the metadatata of the dds object (colData(dds)).

ct.est <- lapply(unique(cortical.markers$cell_type), function(ct){
    message(paste0(ct, "..."))
    Genes <- filter(cortical.markers, cell_type==ct) %>% pull(gene_name)
    estimate_ct(dds, genes=Genes, ct=ct)
}) %>% Reduce("left_join", .)

newColData <- left_join(as_tibble(colData(dds)), ct.est, by="sample_id") %>%
    as("DataFrame")
rownames(newColData) <- newColData$sample_id
all(rownames(colData(dds)) == rownames(newColData))
#> [1] TRUE
colData(dds) <- newColData

And finally, do the same with the RNA markers for the cortical synapsome, which are not quite the same as the set of neuronal markers. These are RNAs found in the synapses, obtained from Hafner et al. 2019.

synaptosome.markers <- read_tsv("./Data/synaptosome_markers.tsv")
#syn <- qusage::read.gmt("./Data/synaptosome_symbols.gmt") %>% qdapTools::list2df() %>% as_tibble
#colnames(syn) <- c("gene_id", "cell_type")
#write_tsv(syn, "./Data/synaptosome_markers.tsv")
synaptosome.markers <- read_tsv("./Data/synaptosome_markers.tsv")

syn.est <- estimate_ct(dds, genes=unique(synaptosome.markers$gene_id), ct="synaptosome")
all(syn.est$sample_id == colData(dds)$sample_id)
#> [1] TRUE
colData(dds)$Synapses <- syn.est$synaptosome
colData(dds)
#> DataFrame with 49 rows and 16 columns
#>            sample_id      origin age_years         sex pm_time_min condition
#>          <character> <character> <numeric> <character>   <numeric>  <factor>
#> SL283595    SL283595          PW        79           M        2880   Control
#> SL283593    SL283593          PW        85           F        4320   Control
#> SL283618    SL283618          PW        88           F        3000   Control
#> SL283601    SL283601          PW        65           M        2880   Control
#> SL283565    SL283565         NBB        85           F         425   Control
#> ...              ...         ...       ...         ...         ...       ...
#> SL283592    SL283592          PW        69           M        3420      Case
#> SL283588    SL283588          PW        72           M        2880      Case
#> SL283600    SL283600          PW        78           F        1440      Case
#> SL283594    SL283594          PW        82           M        5040      Case
#> SL283590    SL283590          PW        95           M        2880      Case
#>                rin       Batch sample_id_paper  Astrocyte Endothelial Microglia
#>          <numeric> <character>     <character>  <numeric>   <numeric> <numeric>
#> SL283595       3.4           2           Ctr-1  -1.184423   -1.160479  0.955535
#> SL283593       5.9           2          Ctr-10  -0.309941    0.537182  0.469155
#> SL283618       5.6           4          Ctr-11   0.217133   -1.489182 -2.429756
#> SL283601       4.9           3          Ctr-14  -0.498226    0.213325  0.744386
#> SL283565       6.6           3          Ctr-23   0.153747    0.319363 -0.260804
#> ...            ...         ...             ...        ...         ...       ...
#> SL283592       4.5           2            PD-5 -0.8294955   -0.938985  0.201632
#> SL283588       5.6           4            PD-6 -0.0199344    0.681234  0.483412
#> SL283600       6.2           3            PD-7 -1.0950377   -0.521545  1.167569
#> SL283594       6.1           1            PD-8  0.6990216    0.729143  0.121045
#> SL283590       5.9           2            PD-9 -2.7878270   -2.481439  1.771681
#>               Oligo OligoPrecursors     Neuron  Synapses
#>           <numeric>       <numeric>  <numeric> <numeric>
#> SL283595   0.782419      -0.6619144  -0.705792 -0.902350
#> SL283593   0.532277       0.0857936   0.630432  0.461680
#> SL283618  -0.160397      -1.1894146  -0.866282 -1.078199
#> SL283601   0.244944      -0.3442277  -0.264953 -0.349270
#> SL283565  -2.747786      -0.1931905   0.123844  0.269135
#> ...             ...             ...        ...       ...
#> SL283592 -1.5972658       -1.140781 -0.4162497 -0.652125
#> SL283588  0.0917733       -0.213461  0.0183912 -0.155062
#> SL283600 -1.1012333       -0.734881 -0.6989150 -0.817925
#> SL283594 -0.2983306        0.625531  0.8981723  0.721039
#> SL283590 -0.1372109       -1.721758 -2.5800732 -2.100692

How can we assess which variables are responsible for most of the variation in gene expression? PCA is very helpful in this regards. By definition, the first principal component will maximize the variance that it explains by combining linearly the variables. In a way, it is “summarizing” gene expression by optimally projecting it into one dimension. Then, with the leftover variance not explained by PC1, it will do the same, but with the constaint that PC2 needs to be perpendicular to PC1.

The first principal components are often referred to as the main lines of variation, because they gather as much as possible, and often serve as a good “summary” of the multidimensional data.

If we want to assess whether cell type composition (or other variables) explain most of the variation, a simple way could be to test for association between variables and PCs. This is a bit involved programmatically, so here is the code:

scale.vec <- function(x) scale(x)[,1]

pca2 <- dds %>% 
    vst %>%
    assay %>%
    t %>%
    prcomp

Metadata.pca <- pca2$x %>%
    as_tibble(rownames="sample_id") %>%
    select(sample_id, PC1:PC5) %>%
    mutate_if(is.numeric, scale.vec) %>%
    left_join(as_tibble(colData(dds)))

Metadata.pca <- Metadata.pca %>% select(-sample_id, -origin, -sample_id_paper) %>% mutate_if(is.character, as.factor) %>%
    mutate_if(is.factor, as.integer)

allCors <- lapply(paste0("PC", 1:5), function(pc) {
    lapply(colnames(Metadata.pca)[6:ncol(Metadata.pca)], function(var){
        #message(paste0(pc, " vs ", var))
        lm(formula=as.formula(paste0(pc, "~", var)), data=Metadata.pca) %>%
            broom::tidy() %>% filter(term!="(Intercept)") %>%
            mutate(PC=pc, VAR=var)
    }) %>% Reduce("bind_rows",.)
}) %>% Reduce("bind_rows",.)

Pvals <- allCors %>% select(p.value, PC, VAR) %>%
    pivot_wider(names_from=VAR, values_from=p.value)
p.mat <- as.matrix(Pvals[,-1])
rownames(p.mat) <- Pvals$PC

Estimates <- allCors %>% select(estimate, PC, VAR) %>%
    pivot_wider(names_from=VAR, values_from=estimate)
e.mat <- as.matrix(Estimates[,-1])
rownames(e.mat) <- Estimates$PC

corrplot::corrplot(e.mat, p.mat=p.mat)
scale.vec <- function(x) scale(x)[,1]

pca2 <- dds %>% 
    vst %>%
    assay %>%
    t %>%
    prcomp

Metadata.pca <- pca2$x %>%
    as_tibble(rownames="sample_id") %>%
    select(sample_id, PC1:PC5) %>%
    mutate_if(is.numeric, scale.vec) %>%
    left_join(as_tibble(colData(dds)))

Metadata.pca <- Metadata.pca %>% select(-sample_id, -origin, -sample_id_paper) %>% mutate_if(is.character, as.factor) %>%
    mutate_if(is.factor, as.integer)

allCors <- lapply(paste0("PC", 1:5), function(pc) {
    lapply(colnames(Metadata.pca)[6:ncol(Metadata.pca)], function(var){
        #message(paste0(pc, " vs ", var))
        lm(formula=as.formula(paste0(pc, "~", var)), data=Metadata.pca) %>%
            broom::tidy() %>% filter(term!="(Intercept)") %>%
            mutate(PC=pc, VAR=var)
    }) %>% Reduce("bind_rows",.)
}) %>% Reduce("bind_rows",.)

Pvals <- allCors %>% select(p.value, PC, VAR) %>%
    pivot_wider(names_from=VAR, values_from=p.value)
p.mat <- as.matrix(Pvals[,-1])
rownames(p.mat) <- Pvals$PC

Estimates <- allCors %>% select(estimate, PC, VAR) %>%
    pivot_wider(names_from=VAR, values_from=estimate)
e.mat <- as.matrix(Estimates[,-1])
rownames(e.mat) <- Estimates$PC

corrplot::corrplot(e.mat, p.mat=p.mat)

Do you see anything interesting? May it be that the RNA quality (RIN) is associated with cell type composition? That can easily be tested with a quick linear regression: try predicting the RIN values with the estimated values for synaptosomal content, then try the same with the estimated neuronal content, and then with both at the same time in the model. The formulas are as follow:

  • rin ~ Synapses
  • rin ~ Neuron
  • rin ~ Synapses + Neuron

Then compare the resulting models using the summary() function on the linear fits.

Run 3 linear regressions with rin as a response variable and 1. Synapses; 2. Neuron; and 3. Synapses + Neuron as predictors. Check the adjusted R-squared values to compare the models.

# RIN
lm.syn <- lm(rin~Synapses, data=as.data.frame(colData(dds)))
lm.neu <- lm(rin~Neuron, data=as.data.frame(colData(dds)))
lm.both <- lm(rin~Neuron+Synapses, data=as.data.frame(colData(dds)))
lm.both.cohort <- lm(rin~Neuron+Synapses+origin, data=as.data.frame(colData(dds)))
lm.syn %>% summary
#> 
#> Call:
#> lm(formula = rin ~ Synapses, data = as.data.frame(colData(dds)))
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1.5931 -0.7244  0.0062  0.6431  2.4892 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   5.9490     0.1348  44.141  < 2e-16 ***
#> Synapses      1.2082     0.1362   8.873 1.31e-11 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.9434 on 47 degrees of freedom
#> Multiple R-squared:  0.6262, Adjusted R-squared:  0.6182 
#> F-statistic: 78.73 on 1 and 47 DF,  p-value: 1.31e-11
lm.neu %>% summary
#> 
#> Call:
#> lm(formula = rin ~ Neuron, data = as.data.frame(colData(dds)))
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.88488 -0.93143 -0.09251  0.55718  2.74936 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   5.9490     0.1552  38.343  < 2e-16 ***
#> Neuron        1.0846     0.1568   6.919 1.09e-08 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.086 on 47 degrees of freedom
#> Multiple R-squared:  0.5046, Adjusted R-squared:  0.494 
#> F-statistic: 47.87 on 1 and 47 DF,  p-value: 1.085e-08
lm.both %>% summary
#> 
#> Call:
#> lm(formula = rin ~ Neuron + Synapses, data = as.data.frame(colData(dds)))
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.76501 -0.49485 -0.05657  0.66050  1.64005 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   5.9490     0.1230  48.378  < 2e-16 ***
#> Neuron       -1.7596     0.5442  -3.234  0.00226 ** 
#> Synapses      2.9213     0.5442   5.369 2.53e-06 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.8608 on 46 degrees of freedom
#> Multiple R-squared:  0.6954, Adjusted R-squared:  0.6822 
#> F-statistic: 52.51 on 2 and 46 DF,  p-value: 1.334e-12
lm.both.cohort %>% summary
#> 
#> Call:
#> lm(formula = rin ~ Neuron + Synapses + origin, data = as.data.frame(colData(dds)))
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.68523 -0.54675 -0.01054  0.52954  1.62786 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   6.0734     0.2152  28.222  < 2e-16 ***
#> Neuron       -1.5875     0.5989  -2.651    0.011 *  
#> Synapses      2.7036     0.6280   4.305  8.9e-05 ***
#> originPW     -0.2177     0.3082  -0.706    0.484    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.8655 on 45 degrees of freedom
#> Multiple R-squared:  0.6988, Adjusted R-squared:  0.6787 
#> F-statistic: 34.79 on 3 and 45 DF,  p-value: 8.653e-12

Synaptosome expression seem to explain RIN the best. But what if we consider “origin” as well? Which is the “best” model? Check the adjusted R squared for the models or run anova() on two models to test whether a more complex model is significantly better at capturing RIN variation.

Add “origin” as a covariate to the previous models and compare the adjusted R squared between the simpler and more comples versions of the model. Even better, use the anova() function to compare the models.

anova(
    lm(rin~Synapses, data=as.data.frame(colData(dds))), 
    lm(rin~Synapses+Neuron, data=as.data.frame(colData(dds)))
)
anova(
    lm(rin~Synapses, data=as.data.frame(colData(dds))), 
    lm(rin~Synapses+origin, data=as.data.frame(colData(dds)))
)
anova(
    lm(rin~Neuron, data=as.data.frame(colData(dds))),
    lm(rin~Neuron+origin, data=as.data.frame(colData(dds)))
)
anova(
    lm(rin~Synapses+origin, data=as.data.frame(colData(dds))),
    lm(rin~Synapses+Neuron+origin, data=as.data.frame(colData(dds)))
)
lm.final <- lm(rin~Synapses+Neuron, data=as.data.frame(colData(dds)))

You can also check the diagnostic plots by simply using the function plot() on the linear fit. Do you see a sample that you may want to exclude before running your model again? Does it make a difference?

plot(lm.final)

4 Differential gene expression

4.1 Running an differential expression analysis

With the DESeq2 package, we can use the function DESeq() on our dds object to run the whole differential expression pipeline, which will run three steps in order:

  1. estimateSizeFactors(dds): estimation of sample-specific normalization parameters
  2. estimateDispersions(dds): estimation of gene-specific dispersion parameters
  3. nbinomWaldTest(dds): negative binomial generalized linear model to calculate the desired log2-fold changes and calculation of Wald statistics

When you run DESeq, by default it will calculate the fit according to whatever model you had specified when you created the object. The model’s formula can be printed on screen using design(dds).

Let’s change the model’s formula of the object like so:

design(dds) <- ~ sex + age_years + condition

and then run DESeq() on the dds object.

Note 1: if you have manually changed something in the colData(dds), it may be that some columns are of type “character”. They should be converted to factors.

Note 2: to avoid some headaches later on, we want to be very explicit with our variable of interest (the disease status) by specifying which of the two levels (“case” or “control”) is the REFERENCE value. This can be done by using the relevel() function like so:

colData(dds)$condition <- relevel(colData(dds)$condition, ref="Control")

Run a differential expression analysis on the dds object using the DESeq() function. The model’s formula should be ~ sex + age + condition. Overwrite with the returned object the dds object.

Useful functions:

  • mutate_if()
  • is.character()
  • DESeq()
# convert to char to factor
chr_cols <- sapply(colData(dds), is.character)
colData(dds)[chr_cols] <- lapply(colData(dds)[chr_cols], as.factor)

# relevel condition just in case
colData(dds)$condition <- relevel(colData(dds)$condition, ref="Control")

design(dds) <- ~ sex + age_years + condition
dds <- DESeq(dds)

4.2 Looking at the results

The results() function extracts the log2 fold changes and p values from the dds object (as long as the DESeq function was run). By default, the function will extract the log2 fold change for the last variable in the design formula (in our case, the disease status). However, it is always good practice to explicitly specify the contrast when calling results() with the “contrast” or “name” argument. There are different ways of specifying contrasts but, for example, if we wanted to extract the differential expression results between males and females, we could use:

results(dds, name="sex_M_vs_F")

or

results(dds, contrast=c("sex", "M", "F"))

they are equivalent. A nice trick to see which contrasts can be extracted using “name=”, you can use

resultsNames(dds)

which prints all possible values for the contrast name. In our case, the possible values for the “name=” argument are: “Intercept”, “sex_M_vs_F”, “age_years”, and “condition_Case_vs_Control”

Use the results() function on the dds object to extract the differential expression between cases and controls and find out whether any genes are significant after correction.

#results(dds, name="sex_M_vs_F")
#results(dds, contrast=c("sex", "M", "F"))
res <- results(dds, name="condition_Case_vs_Control") %>% as_tibble(rownames="gene_name")
res %>% arrange(padj)

Are there any significant changes in gene expression associated with the disease?

We have seen that RIN is associated with the main lines of variation in gene expression. We should expect, then, that RIN is adding a lot of “uninteresting” variation to the dataset, i.e., a lot of noise. What happens if you add “rin” as a covariate?

You’ll need to update the formula and re-run the whole pipeline like so:

design(dds) <- ~ sex + age_years + rin + condition
dds <- DESeq(dds)
res <- results(dds, name="condition_Case_vs_Control") %>% as_tibble(rownames="gene_name")
res %>% arrange(padj)

Re-run the analyses with RIN as a covariate.

dds_rin <- dds
design(dds_rin) <- ~ sex + age_years + rin + condition
dds_rin <- DESeq(dds_rin)
res_rin <- results(dds_rin, name="condition_Case_vs_Control") %>% as_tibble(rownames="gene_name")
res_rin %>% arrange(padj)

What do you think has changed? How does the extra covariate added influence the results? What do you think would have happened if you added to the model a covariate that was highly associated with the condition of interest (i.e., “confounded”)?

Since the cohorts had different RINs and post-mortem intervals, try to run the cohorts separately and explore the results.

Run the differential expression analyses separately for each cohort.

dds_pw <- dds[,colData(dds)$origin=="PW"]
dds_pw <- DESeq(dds_pw)
results(dds_pw) %>% as_tibble(rownames="gene_name") %>% arrange(padj)
dds_nbb <- dds[,colData(dds)$origin=="NBB"]
dds_nbb <- DESeq(dds_nbb)
results(dds_nbb) %>% as_tibble(rownames="gene_name") %>% arrange(padj)

How can two cohorts exhibit such differences in their results? What does this suggest about inter-study replicability?

5 Gene set enrichment analyse

We are going to use the fgsea package to run a gene set enrichment analysis on your results.

Install the fgsea package using Bioconductor::install(fgsea) and load the package.

require('fgsea')

We can download “gmt” formatted files from MSigDB website, which can be imported into R using the gmtPathways() function from the fgsea package. I have already downloaded the KEGG database, which can be loaded into a variable like so:

KEGG_pathways <- gmtPathways("c2.cp.kegg.v2022.1.Hs.symbols.gmt")

Load the KEGG pathway list using the previous command.

KEGG_pathways <- gmtPathways("c2.cp.kegg.v2022.1.Hs.symbols.gmt")

We are going to prepare an input for the fgsea() function. We need to rank the genes according to our value of interest (in our case, it will be the “stat” column of our differential expression results, although you could use -log10(p-value) or log2FoldChange).

Create a vector with the “stat” values from the results and name the values with the corresponding gene names.

ranks <- res$stat
names(ranks) <- res$gene_name

Run gsea_res <- fgsea(KEGG_pathways, ranks) and explore the results (“ranks” is the vector of differential expression stats).

gsea_res <- fgsea(KEGG_pathways, ranks) %>% arrange(padj)
gsea_res

What is at the top? Can you run the GSEA with the results from the two alternate models (i.e., with and withouth accounting for the effect of RIN)?

Create a new vector with the stats of the appropriate results and run fgsea again.

ranks_rin <- res_rin$stat
names(ranks_rin) <- res_rin$gene_name
gsea_res_rin <- fgsea(KEGG_pathways, ranks_rin) %>% arrange(padj)
gsea_res_rin

What happens if you add Neuron or Synapses to the design model? How that does affect the GSEA results?

Run another differential expression analysis using Neuron and/or Synapses as covariates and explore the differences in the enriched pathways.

dds_neu <- dds
design(dds_neu) <- ~ sex + age_years + rin + Neuron + Synapses + condition
dds_neu <- DESeq(dds_neu)
res_neu <- results(dds_neu) %>% as_tibble(rownames="gene_name")
ranks_neu <- res_neu$stat
names(ranks_neu) <- res_neu$gene_name
gsea_res_neu <- fgsea(KEGG_pathways, ranks_neu) %>% arrange(padj)
gsea_res_neu

Can you think of a way to identify the pathways that are more susceptible to RIN and/or Neuron correction?

---
title: "Part 2: RNA-seq analysis"
author: "Gonzalo S. Nido"
date: "`r Sys.Date()`"
output:
    html_notebook:
        toc: true
        toc_float: true
        number_sections: true
        highlight: tango
---

<!--
To convert Markdown document to HTML,
R -e "rmarkdown::render('2.Rmd')"

To spell-check, spelling::spell_check_files()
-->

```{css, echo=FALSE, cache=FALSE}
q { color: Black; font-weight: bold; background-color: khaki; }
```


```{r global_options, include=FALSE, cache=FALSE}
require("knitr")
knitr::opts_chunk$set(fig.path='Figs_Rmd/', fig.height=4, fig.width=3,
                      warning=FALSE, message=FALSE,
                      echo=TRUE,
                      #echo=FALSE,
                      eval=TRUE,
                      #eval=FALSE,
                      results=TRUE,
                      #results=FALSE,
                      cache=FALSE,
                      comment="#>")
require("kableExtra")
```









# Packages

As usual, we will need to load some packages.


<q>Load the `tidyverse` and `DESeq2` packages, install them if they are not
already.</q>


```{r}
require("tidyverse")
require("DESeq2")
```

Now, load the DESeq2 object that you saved previously using the `readRDS`
function.

<q>Load the Rds object containing the DESeq2 object into the "dds" variable</q>

```{r}
dds <- readRDS("./dds.Rds")
```

While we already applied some filtering the to count matrix that we used as an
input, we can still pre-filter the dataset further. You can, in fact, treat the
DESeqDataSet object (in our `dds` variable) as a matrix to subset it.

For example, `dim(dds)` will return the dimensions of the matrix (number of
features or genes, and number of samples). Similarly, we can subset to the
first 3 samples and first 10 genes with `dds[1:10,1:3]`.

To access the full count matrix, you can use `counts(dds)`, and to access the
sample metadata, you use `colData(dds)`.

# Some more prefiltering

## Pre-filtering genes

It is customary to remove genes that have very low expression, since their
signal-to-noise ratio is very small to yield any biologically significant
results.

<q>Keep in the `dds` object only genes with at least 10 reads in more than 25%
of the samples</q>

Useful functions:

* `rowSums()`

```{r}
nrow(dds)
nSamples <- ncol(dds)*.25
genes.keep <- rowSums(counts(dds) >=10) > nSamples
dds <- dds[genes.keep,]
nrow(dds)
```

We went from about 20,200 genes down to about 17,400.


## Sample characterization

Now we want to explore the samples in our dataset. In order to compare the
columns of our matrix, however, we need to normalize the counts. The simplest
way would be to take the sum of each column (i.e., the total library size of
each sample) and divide the counts by the value. The `DESeq2` package offers
some more sophisticated methods, and we will use the "variance stabilising
transformation" in this tutorial (the `vst()` function).  The output is a
`DESeqTransform` object, and the values calculated can be extracted from it
using the `assay()` function:

    norm_expr <- vst(dds) %>% assay

<q>Execute the command above to obtain a matrix of normalized gene
expression</q>

```{r}
norm_expr <- vst(dds) %>% assay
```

Before we plot the expression values we just calculated, let's install and load
the `pheatmap` package, used to plot heatmaps.

```{r}
require("pheatmap")
```

Now, instead of plotting the whole normalized expression matrix that we
obtained using the `vst()` function, let us only plot the top 1,000 most
varying genes using the `pheatmap()` function.

<q>Calculate the standard deviation for each gene and pick the top 1,000 genes
with the highest. Plot the vst expression matrix for those genes using the
`pheatmap()` function.</q>

Useful functions:

* `apply()`
* `sd()`
* `rowSds()`
* `arrange()`
* `desc()`
* `slice_max()`
* `pheatmap()`

```{r, fig.height=12, fig.width=10}
plot.genes <- tibble(gene_name=rownames(norm_expr), sd=norm_expr %>% rowSds ) %>%
    slice_max(sd, n=1000) %>%
    pull(gene_name)
norm_expr[plot.genes, ] %>% pheatmap()
```

The heatmap is not that informative to compare across samples, because it only
shows that some genes have a higher level of expression than others. By
default, `pheatmap()` does not scale the data, but we can do that using
scale="row" or scale="column".

<q>Plot the heatmap again but this time using scale="row"</q>

```{r, fig.height=12, fig.width=10}
norm_expr[plot.genes, ] %>%
    pheatmap(scale="row")
```

What could the two clusters be representing? We can annotate the columns of the
heatmap by using the option "annotation_col" of `pheatmap()`. We need to
provide a `data.frame` with the variables to annotate. For example:

    annotation_col=as.data.frame(colData(dds)[,c("condition","origin")])

<q> Plot the same heatmap but adding annotations for the condition, the origin,
the RIN values and the post-mortem time</q>

```{r, fig.height=10, fig.width=10}
norm_expr[plot.genes, ] %>%
    pheatmap(scale="row", annotation_col=as.data.frame(colData(dds)[,c("condition","origin", "rin", "pm_time_min")]))
```

The fact that the two cohorts are so different is a reason for concern. We know
that  the cohorts have very different RIN values and post-mortem intervals, and
gene expression may reflect those differences.


### Pairwise correlation between samples

We are going to calculate how different *each pair* of samples are, and then
summarize each sample by its median difference to the rest of the samples.

The `cor()` function calculates the correlation between each pair of columns of
the input matrix, and we can take advantage of it.

<q>Use the `cor()` function on the subset of 1,000 most varying genes from the
vst transformed gene expression data to calculate all pairwise correlations and
plot the resulting correlation matrix using `pheatmap()`.</q>

Useful functions:

* `cor()`
* `pheatmap()`

```{r, fig.height=10, fig.width=10}
Cors <- cor(norm_expr[plot.genes, ])
pheatmap(Cors)
```

<q>Calculate the median correlation per sample.</q>

Useful functions:

* `rowMedians()`
* `sort()`


```{r}
medCors <- rowMedians(Cors)
names(medCors) <- rownames(Cors)
medCors %>% sort
```

A straightforward way to display outliers in this case is to plot a boxplot with
ggplot2, which by default plots outliers as points:

    ggplot(tibble(medCors)) + geom_boxplot(aes(x=1, y=medCors))

<q>Plot a ggplot2 boxplot using the command above.</q>

```{r}
ggplot(tibble(medCors)) + geom_boxplot(aes(x=1, y=medCors))
```

We can see that four samples (SL283585, SL283597, SL283590, SL283575) are shown
as outliers - they are quite different from all the rest.

Now we will plot all the samples in the first 2 principal components of a PCA.
The easiest is probably to use the `plotPCA()` function from the `DESeq2`
package, which works on the result from `vst()`. Note that you have to run
`vst()` again on the `dds` object, since `plotPCA()` works on its output.

Useful functions:

* `vst()`
* `plotPCA()`

<q>Use the `plotPCA()` function to visualize the samples in the first 2
principal components of the expression data.</q>

```{r, fig.height=6, fig.width=5}
plotPCA(vst(dds))
```

<q>Do the same, but for each cohort (origin) separately.</q>

```{r, fig.height=6, fig.width=5}
plotPCA(vst(dds)[,colData(dds)$origin=="PW"])
plotPCA(vst(dds)[,colData(dds)$origin=="NBB"])
```

Can you spot the two clusters in the NBB cohort?

The `plotPCA()` function from the `DESeq2` package does something sneaky behind
the scenes. If you check the help for the function, you'll see that the
argument "ntop" is set to 500 by default. To save time, the function carries
out the PCA only on the top 500 most varying genes, similarly to what we were
doing before (but using the top 1,000). What happens if we use, for example,
only the top 100 instead?

<q>Use `plotPCA()` with ntop=400, then ntop=200, and finally with ntop=10 for
all samples.</q>

```{r, fig.height=6, fig.width=5}
plotPCA(vst(dds), ntop=400)
plotPCA(vst(dds), ntop=200)
plotPCA(vst(dds), ntop=10)
```

The fewer genes we use to construct the PCA, the more obvious the clusters
become. What is going on? What could it be that separates our samples in two
groups? Let's have a look at this top 10 most varying genes.

First, load the gene descriptions that we used in the first tutorial using

    geneNames <- readRDS(url("https://git.app.uib.no/neuromics/cell-composition-rna-pd/-/raw/master/Data/EnsDb.Hsapiens.v75.Rds"))

<q>Load the geneNames.Rds file into the R session and convert it to a
tibble</q>

```{r}
geneNames <- readRDS(url("https://git.app.uib.no/neuromics/cell-composition-rna-pd/-/raw/master/Data/EnsDb.Hsapiens.v75.Rds")) %>%
    as_tibble
```

Then, find the gene names of the top10 most varying genes in the `geneNames`
object.

<q>Find the top10 most varying genes in the `geneNames` object.</q>
    
```{r}
tibble(gene_name=rownames(norm_expr), sd=norm_expr %>% rowSds ) %>%
    slice_max(sd, n=10) %>%
    left_join(geneNames) %>%
    select(gene_name, seqnames) %>%
    unique
```

In which chromosome are most of these genes located? Can it explain the groups
you obseved in the PCA?

<q>Plot the PCA again using the `plotPCA()` function using *ntop=10* and
*intgroup="sex"*.</q>

```{r, fig.height=6, fig.width=5}
plotPCA(vst(dds), ntop=10, intgroup="sex")
```

Although sex-associated gene expression is not exclusively restricted to sex
chromosomes, it is of a much greater magnitude in the sex chromosomes.

## Check sex assignment

We can use the strong association between sex-chromosome genes and sex to assess
potential mis-labeling of samples. A quick way is to use all genes located in
the Y chromosome to run a PCA.

<q>Plot all the samples in the two principal components of the PCA of the
Y-located genes, colouring the points by the sex of the sample.</q> NOTE: use
the `vst()` function on the whole dataset, subset after.

```{r, fig.height=6, fig.width=5}
Ygenes <- geneNames %>% filter(seqnames == "Y", gene_biotype == "protein_coding") %>%
    pull(gene_name) %>% unique
vst(dds)[rownames(dds) %in% Ygenes,] %>% plotPCA(ntop=Inf, intgroup="sex")
```

As you can see, most of the variance is explained by the first principal
component, as expected. It would in fact be enough to plot the values for the
PC1.

```
correct_sign <- function(x) {
    if (median(x)<0) return(-1*x)
    else return(x)
}
pca1 <- vst(dds)[rownames(dds) %in% Ygenes,] %>% 
    assay %>%
    t %>%
    prcomp
pca1$x %>%
    as_tibble(rownames="sample_id") %>%
    select(sample_id, PC1) %>%
    mutate(PC1=correct_sign(PC1)) %>%
    left_join(as_tibble(colData(dds))) %>%
    ggplot(aes(x=sex, y=PC1)) +
    geom_boxplot(outlier.shape=NA) +
    geom_jitter(aes(colour=sex), height=0, width=.2)
```

```{r}
correct_sign <- function(x) {
    if (median(x)<0) return(-1*x)
    else return(x)
}
pca1 <- vst(dds)[rownames(dds) %in% Ygenes,] %>% 
    assay %>%
    t %>%
    prcomp
pca1$x %>%
    as_tibble(rownames="sample_id") %>%
    select(sample_id, PC1) %>%
    mutate(PC1=correct_sign(PC1)) %>%
    left_join(as_tibble(colData(dds))) %>%
    ggplot(aes(x=sex, y=PC1)) +
    geom_boxplot(outlier.shape=NA) +
    geom_jitter(aes(colour=sex), height=0, width=.2)
```

For simplicity, we are going to remove from the `dds` object all genes mapped
to the sex chromosomes.

<q>Remove all X and Y genes from the `dds` object.</q>

```{r}
sex.genes <- geneNames %>% filter(seqnames %in% c("X","Y")) %>%
    pull(gene_name) %>% unique
dim(dds)
dds <- dds[!rownames(dds) %in% sex.genes,]
dim(dds)
```

<q>Recalculate the pairwise correlations between samples as before with the
filtered dataset. Plot the values and plot the PCA with ntop=1000</q>

```{r}
norm_expr <- vst(dds) %>% assay
top1000.genes <- tibble(gene_name=rownames(norm_expr), sd=norm_expr %>% rowSds ) %>%
    slice_max(sd, n=1000) %>%
    pull(gene_name)
Cors <- cor(norm_expr[top1000.genes, ])
medCors <- rowMedians(Cors)
names(medCors) <- rownames(Cors)
medCors %>% sort
```

```{r}
ggplot(tibble(medCors)) + geom_boxplot(aes(x=1, y=medCors))
```

```{r, fig.height=6, fig.width=5}
plotPCA(vst(dds))
```

In order to plot the sample ids, we need to use the "returnData=TRUE" option,
in the `plotPCA()` function which, instead of plotting, will then returns the
data in a format that can be easily used with ggplot2, adding a layer with
the labels:

    ggplot(data) +
    geom_text(aes(label=name), nudge_y=1.5, size=3)

We could also, of course, run the PCA ourselves using `prcomp()`.

```{r, fig.height=5, fig.width=8}
plotPCA(vst(dds), returnData=TRUE) %>%
  ggplot(aes(x=PC1, y=PC2)) + 
  geom_point(aes(colour=condition)) +
  geom_text(aes(label=name), nudge_y=1.5, size=3)
```

Evaluating the PCA plot and the heatmap of correlations, do you think it is
justified to remove any of the samples?






# Cell type estimation

Before proceeding with the differential expression analysis, since we are
dealing with human brain tissue, we need to investigate what the cell
composition is in our samples. This is of fundamental importance in these type
of datasets because **most of the variability in gene expression in explained
by the cell types**.

We will use a very simple approach to estimate cell types similar, in a way, to
what we used to "estimate" sex: summarize gene expression of a set of selected
markers as their first principal component (PC1). The marker genes will be
known cell type-specific genes.

Let's load a list of markers of cortical cell types from
[Neuroespresso](https://www.eneuro.org/content/4/6/ENEURO.0212-17.2017):

    cortical.markers <- read_tsv("./Data/cortical_markers.tsv")

<q>Read the tab-separated file "cortical_markers.tsv" into a tibble </q>

```{r}
cortical.markers <- read_tsv("./Data/cortical_markers.tsv")
```

Now let us run a PCA (as we have done until now) but subsetting the genes to
the neuronal markers. We will use the `prcomp()` function and extract the
values of the PC1 for each sample.

<q>Run a PCA using `prcomp()` only on neuronal markers. Explore the result and
find where the coordinates for the first PC are.</q>

```{r}
estimate_ct <- function(object, genes, ct="PC1", rescale=TRUE) {
    object.vst <- vst(object)
    pca <- object.vst[rownames(object.vst) %in% genes,] %>%
        assay() %>%
        t() %>%
        prcomp()
    result <- as_tibble(pca$x, rownames="sample_id") %>%
        select(sample_id, PC1) %>%
        mutate(PC1=correct_sign(PC1))
    colnames(result) <- c("sample_id", ct)
    if (rescale) result[,ct] <- scale(result[,ct])[,1]
    return(result)
}
neuronal.est <- estimate_ct(dds, cortical.markers %>% filter(cell_type=="Neuron") %>% pull(gene_name), ct="Neuron")
```

Now we can do that for every other cortical cell type.

<q>Calculate the first principal component for the other cell types and add
them all to the metadatata of the dds object (`colData(dds)`).</q>

```{r}
ct.est <- lapply(unique(cortical.markers$cell_type), function(ct){
    message(paste0(ct, "..."))
    Genes <- filter(cortical.markers, cell_type==ct) %>% pull(gene_name)
    estimate_ct(dds, genes=Genes, ct=ct)
}) %>% Reduce("left_join", .)

newColData <- left_join(as_tibble(colData(dds)), ct.est, by="sample_id") %>%
    as("DataFrame")
rownames(newColData) <- newColData$sample_id
all(rownames(colData(dds)) == rownames(newColData))
colData(dds) <- newColData
```

And finally, do the same with the RNA markers for the cortical synapsome, which
are not quite the same as the set of neuronal markers. These are RNAs found in
the synapses, obtained from [Hafner et al.
2019](https://pubmed.ncbi.nlm.nih.gov/31097639/).

    synaptosome.markers <- read_tsv("./Data/synaptosome_markers.tsv")
    
```{r}
#syn <- qusage::read.gmt("./Data/synaptosome_symbols.gmt") %>% qdapTools::list2df() %>% as_tibble
#colnames(syn) <- c("gene_id", "cell_type")
#write_tsv(syn, "./Data/synaptosome_markers.tsv")
synaptosome.markers <- read_tsv("./Data/synaptosome_markers.tsv")

syn.est <- estimate_ct(dds, genes=unique(synaptosome.markers$gene_id), ct="synaptosome")
all(syn.est$sample_id == colData(dds)$sample_id)
colData(dds)$Synapses <- syn.est$synaptosome
colData(dds)
```

How can we assess which variables are responsible for most of the variation in
gene expression? PCA is very helpful in this regards. By definition, the first
principal component will maximize the variance that it explains by combining
linearly the variables. In a way, it is "summarizing" gene expression by
optimally projecting it into one dimension. Then, with the leftover variance
not explained by PC1, it will do the same, but with the constaint that PC2
needs to be perpendicular to PC1.

The first principal components are often referred to as the main lines of
variation, because they gather as much as possible, and often serve as a good
"summary" of the multidimensional data.

If we want to assess whether cell type composition (or other variables) explain
most of the variation, a simple way could be to test for association between
variables and PCs. This is a bit involved programmatically, so here is the
code:

```
scale.vec <- function(x) scale(x)[,1]

pca2 <- dds %>% 
    vst %>%
    assay %>%
    t %>%
    prcomp

Metadata.pca <- pca2$x %>%
    as_tibble(rownames="sample_id") %>%
    select(sample_id, PC1:PC5) %>%
    mutate_if(is.numeric, scale.vec) %>%
    left_join(as_tibble(colData(dds)))

Metadata.pca <- Metadata.pca %>% select(-sample_id, -origin, -sample_id_paper) %>% mutate_if(is.character, as.factor) %>%
    mutate_if(is.factor, as.integer)

allCors <- lapply(paste0("PC", 1:5), function(pc) {
    lapply(colnames(Metadata.pca)[6:ncol(Metadata.pca)], function(var){
        #message(paste0(pc, " vs ", var))
        lm(formula=as.formula(paste0(pc, "~", var)), data=Metadata.pca) %>%
            broom::tidy() %>% filter(term!="(Intercept)") %>%
            mutate(PC=pc, VAR=var)
    }) %>% Reduce("bind_rows",.)
}) %>% Reduce("bind_rows",.)

Pvals <- allCors %>% select(p.value, PC, VAR) %>%
    pivot_wider(names_from=VAR, values_from=p.value)
p.mat <- as.matrix(Pvals[,-1])
rownames(p.mat) <- Pvals$PC

Estimates <- allCors %>% select(estimate, PC, VAR) %>%
    pivot_wider(names_from=VAR, values_from=estimate)
e.mat <- as.matrix(Estimates[,-1])
rownames(e.mat) <- Estimates$PC

corrplot::corrplot(e.mat, p.mat=p.mat)
```

```{r, fig.height=4, fig.width=4}
scale.vec <- function(x) scale(x)[,1]

pca2 <- dds %>% 
    vst %>%
    assay %>%
    t %>%
    prcomp

Metadata.pca <- pca2$x %>%
    as_tibble(rownames="sample_id") %>%
    select(sample_id, PC1:PC5) %>%
    mutate_if(is.numeric, scale.vec) %>%
    left_join(as_tibble(colData(dds)))

Metadata.pca <- Metadata.pca %>% select(-sample_id, -origin, -sample_id_paper) %>% mutate_if(is.character, as.factor) %>%
    mutate_if(is.factor, as.integer)

allCors <- lapply(paste0("PC", 1:5), function(pc) {
    lapply(colnames(Metadata.pca)[6:ncol(Metadata.pca)], function(var){
        #message(paste0(pc, " vs ", var))
        lm(formula=as.formula(paste0(pc, "~", var)), data=Metadata.pca) %>%
            broom::tidy() %>% filter(term!="(Intercept)") %>%
            mutate(PC=pc, VAR=var)
    }) %>% Reduce("bind_rows",.)
}) %>% Reduce("bind_rows",.)

Pvals <- allCors %>% select(p.value, PC, VAR) %>%
    pivot_wider(names_from=VAR, values_from=p.value)
p.mat <- as.matrix(Pvals[,-1])
rownames(p.mat) <- Pvals$PC

Estimates <- allCors %>% select(estimate, PC, VAR) %>%
    pivot_wider(names_from=VAR, values_from=estimate)
e.mat <- as.matrix(Estimates[,-1])
rownames(e.mat) <- Estimates$PC

corrplot::corrplot(e.mat, p.mat=p.mat)
```

Do you see anything interesting? May it be that the RNA quality (RIN) is
associated with cell type composition? That can easily be tested with a quick
linear regression: try predicting the RIN values with the estimated values for
synaptosomal content, then try the same with the estimated neuronal content,
and then with both at the same time in the model. The formulas are as follow:

* rin ~ Synapses
* rin ~ Neuron
* rin ~ Synapses + Neuron

Then compare the resulting models using the `summary()` function on the linear
fits.

<q>Run 3 linear regressions with rin as a response variable and 1. Synapses; 2.
Neuron; and 3. Synapses + Neuron as predictors. Check the adjusted R-squared
values to compare the models.</q>

```{r}
# RIN
lm.syn <- lm(rin~Synapses, data=as.data.frame(colData(dds)))
lm.neu <- lm(rin~Neuron, data=as.data.frame(colData(dds)))
lm.both <- lm(rin~Neuron+Synapses, data=as.data.frame(colData(dds)))
lm.both.cohort <- lm(rin~Neuron+Synapses+origin, data=as.data.frame(colData(dds)))
lm.syn %>% summary
lm.neu %>% summary
lm.both %>% summary
lm.both.cohort %>% summary
```

Synaptosome expression seem to explain RIN the best. But what if we consider
"origin" as well? Which is the "best" model? Check the adjusted R squared for
the models or run `anova()` on two models to test whether a more complex model
is significantly better at capturing RIN variation.

<q>Add "origin" as a covariate to the previous models and compare the adjusted
R squared between the simpler and more comples versions of the model. Even
better, use the `anova()` function to compare the models.</q>

```{r}
anova(
    lm(rin~Synapses, data=as.data.frame(colData(dds))), 
    lm(rin~Synapses+Neuron, data=as.data.frame(colData(dds)))
)
anova(
    lm(rin~Synapses, data=as.data.frame(colData(dds))), 
    lm(rin~Synapses+origin, data=as.data.frame(colData(dds)))
)
anova(
    lm(rin~Neuron, data=as.data.frame(colData(dds))),
    lm(rin~Neuron+origin, data=as.data.frame(colData(dds)))
)
anova(
    lm(rin~Synapses+origin, data=as.data.frame(colData(dds))),
    lm(rin~Synapses+Neuron+origin, data=as.data.frame(colData(dds)))
)
lm.final <- lm(rin~Synapses+Neuron, data=as.data.frame(colData(dds)))
```

You can also check the diagnostic plots by simply using the function `plot()`
on the linear fit. Do you see a sample that you may want to exclude before
running your model again? Does it make a difference?

```{r, fig.height=4, fig.width=5}
plot(lm.final)
```




# Differential gene expression

## Running an differential expression analysis

With the `DESeq2` package, we can use the function `DESeq()` on our `dds`
object to run the whole differential expression pipeline, which will run three
steps in order:

1. `estimateSizeFactors(dds)`: estimation of sample-specific normalization parameters
2. `estimateDispersions(dds)`: estimation of gene-specific dispersion parameters
3. `nbinomWaldTest(dds)`: negative binomial generalized linear model to calculate the desired log2-fold changes and calculation of Wald statistics

When you run `DESeq`, by default it will calculate the fit according to
whatever model you had specified when you created the object. The model's
formula can be printed on screen using `design(dds)`.

Let's change the model's formula of the object like so:

    design(dds) <- ~ sex + age_years + condition

and then run `DESeq()` on the `dds` object.

**Note 1:** if you have manually changed something in the `colData(dds)`, it may
be that some columns are of type "character". They should be converted to
factors.

**Note 2:** to avoid some headaches later on, we want to be very explicit with
our variable of interest (the disease status) by specifying which of the two
levels ("case" or "control") is the REFERENCE value. This can be done by using
the `relevel()` function like so:

    colData(dds)$condition <- relevel(colData(dds)$condition, ref="Control")

<q>Run a differential expression analysis on the `dds` object using the
`DESeq()` function. The model's formula should be __~ sex + age + condition__.
Overwrite with the returned object the `dds` object.</q>

Useful functions:

* `mutate_if()`
* `is.character()`
* `DESeq()`

```{r}
# convert to char to factor
chr_cols <- sapply(colData(dds), is.character)
colData(dds)[chr_cols] <- lapply(colData(dds)[chr_cols], as.factor)

# relevel condition just in case
colData(dds)$condition <- relevel(colData(dds)$condition, ref="Control")

design(dds) <- ~ sex + age_years + condition
dds <- DESeq(dds)
```

## Looking at the results

The `results()` function extracts the log2 fold changes and p values from the
`dds` object (as long as the `DESeq` function was run). By default, the
function will extract the log2 fold change **for the last variable in the
design formula** (in our case, the disease status). However, it is always good
practice to explicitly specify the contrast when calling `results()` with the
"contrast" or "name" argument. There are different ways of specifying contrasts
but, for example, if we wanted to extract the differential expression results
between males and females, we could use:

    results(dds, name="sex_M_vs_F")

or

    results(dds, contrast=c("sex", "M", "F"))

they are equivalent. A nice trick to see which contrasts can be extracted using
"name=", you can use

    resultsNames(dds)

which prints all possible values for the contrast name. In our case, the
possible values for the "name=" argument are: **"Intercept", "sex_M_vs_F",
"age_years", and "condition_Case_vs_Control"**

<q>Use the `results()` function on the `dds` object to extract the differential
expression between cases and controls and find out whether any genes are
significant after correction.</q>

```{r}
#results(dds, name="sex_M_vs_F")
#results(dds, contrast=c("sex", "M", "F"))
res <- results(dds, name="condition_Case_vs_Control") %>% as_tibble(rownames="gene_name")
res %>% arrange(padj)
```

Are there any significant changes in gene expression associated with the
disease? 

We have seen that RIN is associated with the main lines of variation in gene
expression. We should expect, then, that RIN is adding a lot of "uninteresting"
variation to the dataset, i.e., a lot of noise. What happens if you add "rin"
as a covariate?

You'll need to update the formula and re-run the whole pipeline like so:

```
design(dds) <- ~ sex + age_years + rin + condition
dds <- DESeq(dds)
res <- results(dds, name="condition_Case_vs_Control") %>% as_tibble(rownames="gene_name")
res %>% arrange(padj)
```

<q>Re-run the analyses with RIN as a covariate.</q>

```{r}
dds_rin <- dds
design(dds_rin) <- ~ sex + age_years + rin + condition
dds_rin <- DESeq(dds_rin)
res_rin <- results(dds_rin, name="condition_Case_vs_Control") %>% as_tibble(rownames="gene_name")
res_rin %>% arrange(padj)
```

What do you think has changed? How does the extra covariate added influence the
results? What do you think would have happened if you added to the model a
covariate that was highly associated with the condition of interest (i.e.,
"confounded")?


Since the cohorts had different RINs and post-mortem intervals, try to run the
cohorts separately and explore the results.

<q>Run the differential expression analyses separately for each cohort.</q>

```{r}
dds_pw <- dds[,colData(dds)$origin=="PW"]
dds_pw <- DESeq(dds_pw)
results(dds_pw) %>% as_tibble(rownames="gene_name") %>% arrange(padj)
dds_nbb <- dds[,colData(dds)$origin=="NBB"]
dds_nbb <- DESeq(dds_nbb)
results(dds_nbb) %>% as_tibble(rownames="gene_name") %>% arrange(padj)
```

How can two cohorts exhibit such differences in their results? What does this
suggest about inter-study replicability?



# Gene set enrichment analyse

We are going to use the `fgsea` package to run a gene set enrichment analysis
on your results.

<q>Install the `fgsea` package using `Bioconductor::install(fgsea)` and load
the package.</q>

```{r}
require('fgsea')
```

We can download "gmt" formatted files from
[MSigDB](http://www.gsea-msigdb.org/gsea/msigdb/collections.jsp) website, which
can be imported into R using the `gmtPathways()` function from the `fgsea`
package. I have already downloaded the KEGG database, which can be loaded into
a variable like so:

    KEGG_pathways <- gmtPathways("c2.cp.kegg.v2022.1.Hs.symbols.gmt")

<q>Load the KEGG pathway list using the previous command.</q>

```{r}
KEGG_pathways <- gmtPathways("c2.cp.kegg.v2022.1.Hs.symbols.gmt")
```

We are going to prepare an input for the `fgsea()` function. We need to rank
the genes according to our value of interest (in our case, it will be the
"stat" column of our differential expression results, although you could use
-log10(p-value) or log2FoldChange).

<q>Create a vector with the "stat" values from the results and name the values
with the corresponding gene names.<q>

```{r}
ranks <- res$stat
names(ranks) <- res$gene_name
```

<q>Run `gsea_res <- fgsea(KEGG_pathways, ranks)` and explore the results
("ranks" is the vector of differential expression stats).</q>

```{r}
gsea_res <- fgsea(KEGG_pathways, ranks) %>% arrange(padj)
gsea_res
```

What is at the top? Can you run the GSEA with the results from the two
alternate models (i.e., with and withouth accounting for the effect of RIN)?

<q>Create a new vector with the stats of the appropriate results and run
`fgsea` again.<q>

```{r}
ranks_rin <- res_rin$stat
names(ranks_rin) <- res_rin$gene_name
gsea_res_rin <- fgsea(KEGG_pathways, ranks_rin) %>% arrange(padj)
gsea_res_rin
```

What happens if you add Neuron or Synapses to the design model? How that does
affect the GSEA results?

<q>Run another differential expression analysis using Neuron and/or Synapses as
covariates and explore the differences in the enriched pathways.<q>

```{r}
dds_neu <- dds
design(dds_neu) <- ~ sex + age_years + rin + Neuron + Synapses + condition
dds_neu <- DESeq(dds_neu)
res_neu <- results(dds_neu) %>% as_tibble(rownames="gene_name")
ranks_neu <- res_neu$stat
names(ranks_neu) <- res_neu$gene_name
gsea_res_neu <- fgsea(KEGG_pathways, ranks_neu) %>% arrange(padj)
gsea_res_neu
```


Can you think of a way to identify the pathways that are more susceptible to
RIN and/or Neuron correction?


